Community Detection in Complex Networks Using Genetic Algorithms 
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Community detection is an important research topic in complex networks. We present the em- 
ployment of a genetic algorithm to detect communities in complex networks which is based on 
optimizing network modularity. It does not need any prior knowledge about the number of commu- 
nities. Its performance is tested on two real life networks with known community structures and a 
set of synthetic networks. As the performance measure an information theoretical metric, variation 
of information, is used. The results are promising and in some cases better than previously reported 
studies. 



I. INTRODUCTION 



Community structure detection is one of the hot topics 
that have created a great interest in complex network 
studies. A community is loosely defined as a group of 
vertices with a high density of in-group and a low density 
of out-group edges and their identification in complex 
networks calls for techniques borrowed from physics and 
computer sciences [l|, 0, Q ■ 

Different methods and algorithms have been proposed 
to reveal the underlying community structure in com- 
plex networks. A crucial part of the algorithms is how 
they define a community [J]. There are different formal 
definitions of a community [5[. In this paper, we use a 
quantitative definition proposed by Girvan and Newman 
which makes use of a measure called network modular- 
ity [6[. The network modularity Q is defined as 



Q = E( f 



(1) 



where the index i runs over all communities, en is the 
fraction of edges that connect two nodes within group i, 
while at is the fraction of edges that have at least one 
endpoint within the group. Some of the recent com- 
munity detection algorithms like Newman's fast algo- 
rithm for detecting communities, the algorithm for very 
large networks, and the algorithm using Extremal Op- 
timization use the network modularity as quality met- 
ric 0, [j, 0|- Calculation of the network modularity is 
less time consuming than the edge betweenness central- 
ity used in Girvan-Newman (GN) algorithm 6]. 

In this paper, we propose a new community detection 
algorithm which tries to optimize the network modular- 
ity by employing genetic algorithms. Unlike the previous 
methods, the new algorithm does not require the number 
of communities present in a graph. The number of com- 
munities comes as an emergent result as the modularity 
value is optimized. 



II. BACKGROUND 

Our algorithm is based on the optimization of network 
modularity Q by employing a genetic algorithm. The 
algorithm produces a partition on the set of vertices of 
graph. Success of a community detection algorithm is 
defined as the closeness of the partition generated by the 
algorithm to the partition corresponding to the real com- 
munity structure. Therefore before describing the algo- 
rithm, let us introduce the terminology on partitions that 
we will use throughout this paper. 



A. Representation 

We are given an undirected graph G(V, E) with \V\ = 
n vertices and \E\ = e edges. Assuming a fixed ordering 
of the vertices V — {vi, V2, ■ • ■ , V n }, any partition O of V 
can be represented by a vector k — [k 1 k 2 ■ ■ ■ K n ] of n 
dimensions where n % £ {1, 2, • ■ • , |fi|} is the index of the 
cluster of the vertex m . The vertices Vi and Vj are in the 
same cluster if and only if k 1 — k? . Note that this is not 
a canonical representation, meaning that there are many 
different vectors corresponding to the same partition. An 
example would be helpful: Let V = {y\, V2, «3, ^4} be the 
set of vertices in the given order and consider the parti- 
tion ft = {{v\, U3}, {^2}, {^4}}- Then, although they are 
different, the vectors Ki = [12 13], K2 = [4243] and 
Ks = [2 1 2 4] represent the partition. 

Let F be the partition of V that corresponds the real 
community structure of the underlying the graph. An el- 
ement 7 of r is called a community. In total, there are |T| 
elements corresponding to |T| communities. The purpose 
of the community detection is to build an estimation of 
the partition T based on the topology of the graph. Let 
$7 be the partition that represents our estimated com- 
munity structure. An element w of O is called a cluster. 
Note that the number of clusters |0| does not have to be 
equal to the number of communities |T|. 

In the most general case, the output of our algorithm 
will be a vector n of length n which corresponds to our 
estimated partition il of the vertex set. We assume that 
the number of communities |r| is unknown but has to be 



estimated by the algorithm. 



B. Distance Metric for Partitions 

The output of our algorithm is a partition Q of the 
vertices in the graph. The evaluation of a resulting clus- 
tering Q and a given community structure T of a network 
is not straightforward because it is not always clear which 
cluster corresponds to which community and how to deal 
with mixed clusters which contain members of two or 
more communities. Even the number of clusters |fi| and 
the number of communities |F| may differ. 

Two important issues regarding the evaluation of a 
clustering algorithm is the accuracy and the precision 
of the algorithm. Accuracy is a measure of the success 
of an algorithm in clustering the members of the same 
community together without any separation (i.e. intra- 
cluster scatter). Precision is a measure of the success 
of an algorithm in creating homogeneous clusters which 
contain the members of the same communities (i.e. inter- 
cluster scatter). 

In order to understand these concepts, consider two 
extreme cases. The partition of singletons, where each 
vertex is a cluster by itself, that is |f2| = n , is very 
precise since no cluster contains elements of more then 
one communities. On the other hand it is very inaccurate 
since the elements of any community are scattered into 
many clusters. The second extreme is the case where the 
partition is composed of single cluster only, |0| = 1. All 
the elements of any community are in the same cluster, 
that is very imprecise. But all the communities are in 
the same cluster which means it is very accurate. 



1. Variation of Information 

We decided to employ an information theoretical met- 
ric called variation of information S introduced in Ref. 8 
specifically oriented to compare results of different clus- 
terings. By using S, it is possible to calculate a distance 
between two partitions. Before proceeding further, let us 
define S more precisely. Let T and Q be two partitions of 
the set V = {vi,v 2 ,-- ■ ,v n }. Let L = {71,72,- ■•,7|r|} 
be the set of communities and ft = {wj., u>2, • • • , Wio,i) 
be the set of clusters. Consider the partition f2 and a 
randomly picked element Vi from V. Without any other 
information, our uncertainty about which cluster of Q 
the vertex m is assigned to is shaped by the distribution 
of the partition fi. For example, if all vertices are as- 
signed to the same cluster then there is no uncertainty. 
If each cluster receives an equal number of vertices (ho- 
mogeneous distribution) then the uncertainty is at a max- 
imum. To measure the uncertainty, we can use informa- 
tion entropy which is a well known metric of uncertainty. 
In Ref. |8|, the entropy associated with a partition il is 
denoted by H(fi) and defined as follows: 



fl r (n) = -^p(a;)log( P ((«;)) 



(2) 



u£!l 



where p(u>) = \u)\ jn is the probability that a randomly 
chosen vertex is assigned to the cluster u in partition Q. 
The base of the logarithm is irrelevant in our context and 
we employ the binary logarithm function so the unit of 
H is bit. 

Now, imagine that we have the knowledge about com- 
munity partition T of the same graph and we know which 
community 7 the randomly picked vertex Vi is assigned to 
in partition I\ This allows us to calculate the conditional 
entropy H(Q\T) defined as 



H(Q\T) = -J2 £>(w,7)log(pH7)) 



(3) 



we0 7er 



which is the amount of entropy (i.e. uncertainty) re- 
maining in Q given our knowledge about T. The joint 
probability p(u>,~/) is the probability that our randomly 
selected vertex is assigned to cluster u> in partition Q and 
to cluster 7 in partition F. The conditional probability 
p(u>\j) is the probability that our randomly selected ver- 
tex is assigned to cluster u> in partition Q given that we 
know it is assigned to community 7 in partition F. The 
conditional entropy H(Q\T) is always non-negative. It is 
when the knowledge about T perfectly determines CI. 
H(T\fl) is defined similarly. 

The variation of information S is defined as 



S(T,n) = H(T\ft) + H(n\T) 



(4) 



We can use S to combine precision and accuracy val- 
ues into a single metric. The conditional entropy H(T\fl) 
is the amount of our uncertainty in T given our knowl- 
edge of f2. It can be used to measure the precision. If 
an algorithm assigns each vertex to a different cluster in 
n then knowing Q completely determines T. The condi- 
tional entropy H(Cl\T), on the other hand, can be used 
to calculate the accuracy of the algorithm. Lower values 
of S correspond to less uncertainty (e.g. S — means 
the two partitions are identical, hence there is no uncer- 
tainity), higher values correspond to more uncertainty. 
Interested readers may refer to Ref. [8| for further discus- 
sion of the metric. We should also note that just before 
the writing of this paper was completed, we came across 
a recent study which also incorporates S as a metric [9[ . 



C. Genetic Algorithms 

Genetic algorithms (GA), as proposed in Ref. [Ty, are 
a set of optimization techniques inspired by the biolog- 
ical evolution. Successful applications have been made 
to a wide variety of problems including e nerg y minimiza- 
tion [111] j traveling salesman problem [lj, [l3| , neural net- 
works and cryptology [l4[, process scheduling [l5[ and 
other various optimization processes [161 ] . 



In a typical GA, there is an objective function (called 
as the fitness function) to be optimized and a set of can- 
didate solutions which are encoded as a kind of numerical 
chromosome. At the start of the algorithm, one begins by 
generating a random population of candidate solutions. 
The candidates are evaluated by using the fitness func- 
tion. The next generation of candidate solutions is gener- 
ated by applying certain biologically inspired manipula- 
tions to the current pool of candidates and the solutions 
with higher fitness values have higher chances to be rep- 
resented in the next generation. Here the fitness function 
plays the role of reproductive fitness in Darwinian natural 
selection. Repeated rounds of fitness evaluation, repro- 
duction, and selection cause the initially random popula- 
tion of candidate solutions to evolve toward a population 
enriched in more optimal (in terms of fitness function) 
solutions. The main operations used to generate new 
potential solutions are analogues of point mutation (ran- 
dom changes to some part of the numerical chromosome) 
and crossing over (forming new chromosomes by combin- 
ing segments of existing ones) . Much of the skill in using 
this approach rests in setting up the relationship between 
the chromosomes and the parameters of the optimiza- 
tion problem in such a way that the evolution operations, 
point mutation and crossing over, generate better, or at 
least not substantially poorer, candidate solutions. Ge- 
netic algorithms are particularly attractive for problems 
such as combinatorial optimization, where the objective 
function has little or no smooth structure. GAs have 
the further charm that they require no arbitrary conver- 
gence criteria. They are not, however, parameter-free: 
one must choose, for example, population sizes, rates of 
mutation, and numbers of generations. 



A. Initial Population 

Initially, for all chromosomes, each vertex is put in 
a different cluster. Thus the number of clusters for each 
chromosome in the initial population is n. It is a common 
practice to give the genetic algorithm not a completely 
random initial starting point but a biased one in order 
to speed up the convergence. For this purpose, we em- 
ployed a very simple heuristic. For chromosome up,, we 
randomly pick a vertex Vi and assign its cluster to all of 
its neighbors (i.e. K° k <— n\ whenever {vi,Vj) G E). We 
repeat this operation an times for each chromosome in 
the initial population where a is a model parameter and 
a = 0.4 is used for the experiments reported in this pa- 
per. This operation is extremely fast and results in local 
small communities. But the resulting clusterings are still 
far away from being optimal as we will see in the next 
section. 



B. Main Loop 

After the initial population is created, the main loop of 
the algorithm is repeated g times. Since at each iteration 
of the loop a new generation is obtained, g is called the 
number of generations: 

1. Apply the fitness function to chromosomes. 

2. Sort the chromosomes with respect to the fitness 
value and take the top p. 

3. Save the top ftp of the the chromosomes for later 



III. THE ALGORITHM 

It is possible to employ different kinds of genetic al- 
gorithms for a particular problem and for every imple- 
mentation there will be several model parameters like 
the number of chromosomes, the rate of mutation, or the 
rate of crossing over as we will explain later. Unfortu- 
nately, the values of these parameters are not dictated 
by the problem at hand but has to be set to some val- 
ues (somewhat arbitrarily) by us. Since our purpose in 
this study is to show that genetic algorithms are a viable 
approach for the community detection algorithm, we will 
suffice to employ values that are found by trial and er- 
ror. We will not neither try to study the effect of differing 
parameter values nor propose a general way to come up 
with "good" parameter values. This kind of analysis is 
out of the scope of this paper. 

We use n-vectors k as the chromosomes and the net- 
work modularity Q as the fitness function of our genetic 
algorithm. The population P = {«i, «2, ' ■ ■ j k p } is the 
set of all chromosomes. Note that the population size 
p = \P\ is a model parameter. 



4. Pair the sorted chromosomes (ie. Kk with Kfe+i 
whenever k is odd, assuming p is even) and apply 
crossover operation to the pairs. 

5. Apply mutation. 

6. Combine newly obtain p chromosomes and the pre- 
viously saved Pp. 

Note that the last step is an elitist approach and en- 
sures that the fitness scores of the top (3p of the child 
generation will be at least as good as the parent popu- 
lation. Note also that except the initial generation, the 
second step starts with (1 + f3)p chromosomes and ends 
with p chromosomes. Here /? and g are model parame- 
ters. We used /3 = 0.1 in this work. Now, let us focus on 
the crossing over and mutation operators in detail. 



C. Crossing over 

Traditionally, crossover operator takes two chromo- 
somes, merges them together and returns two new chro- 
mosomes. A crossing over point in each of the chromo- 
somes is selected, and all the elements of the chromo- 



somes after that selection point are exchanged between 
the two chromosomes. 

Unfortunately, in our settings, the encoding of the 
chromosomes does not allow such a straightforward cross- 
ing over operation. For each chromosome, the clusters to 
which the vertices are assigned are represented by ar- 
bitrary integers and the values in two different chromo- 
somes are not compatible as discussed in Section III Al 

Instead of employing a crossing over operation based 
on mutual exchange, we decided to introduce a one-way 
crossing over operation. One of the chromosomes in the 
selected pair is called the source chromosome n src and 
the other is called the destination chromosome Udest- 
The crossing over procedure is defined as follows. We 
pick vertex Vi at random, determine its cluster (i.e. n l src ) 
in the source chromosome and make sure that all the 
vertices in this cluster of the source chromosome are also 
assigned to the same cluster in the destination chromo- 



some (i.e. 



., Vfc G {k 



,})• 



The crossing over procedure is repeated r\n times on 
the chromosome. The crossing over rate 77 is a model pa- 
rameter and set to 77 = 0.2 for the experiments reported 
in this paper. An example of a crossing over application 
is given in Table |U Note that as a result of crossing over, 
i>7 becomes in the same community with 1*4. 

TABLE I: One-way crossing over when V4 is selected 
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D. Mutation 

We employ a point mutation operator defined as fol- 
lows: We randomly pick a chromosome k to be mutated. 
Then we pick two vertices v% and Vj randomly. The clus- 
ter of Vj is set to the cluster of Vi (i.e. k? <— k 4 ). The 
mutation procedure is repeated C n times where the mu- 
tation rate £ is a model parameter and set to £ = 0.5 for 
the experiments reported in this paper. 



IV. EXPERIMENTAL RESULTS 

Here we set several model parameters to seemingly ar- 
bitrary values such as the population size p. We would 
like to stress that our aim in this paper is not to fine tune 
the genetic algorithm but to provide a proof of concept 
that genetic algorithms are capable of producing com- 
patible results with the previous algorithms. We evalu- 



ated our algorithm on two well known datasets and a set 
of computer generated networks with known community 
structures. 



A. Zachary Karate Club 

The Zachary Karate Club network, which is one of the 
few data sets with known community structure, is ana- 
lyzed first in Ref. [Tj. The network consists of 34 vertices 
and 78 edges. We ran our algorithm on this dataset for 
50 times for g = 250 generations with population size is 
set to p = 100. 

Although our algorithm does not know the number of 
communities, in all of the runs, the resulting partition £1 
consisted of 2 clusters as it should ideally be. In 49 runs, 
the clusters perfectly matched the real communities. In 
one run we observed that one vertex is misplaced. 



B. College Football Network 

College football network is built by using the college 
football matches in USA, for Division I during the year 
2000 [l8| . The vertices in the network are the college 
football teams and there is an edge between two teams if 
they played a match during the season. The real commu- 
nity structure is the conferences that each team belongs 
to. The teams tend to play more matches with teams that 
are in the same conference and play less inter-conference 
matches. 

The dataset consists of 115 vertices with 12 com- 
munities (i.e. conferences). We ran the algorithm 
on this network 10 times with p = 200 and g £ 
{100, 200, 400, 800, 1600, 3200, 6400}. The S scores ob- 
tained are presented with those of the fast algorithm for 
community detection (Fast Newman) in Fig. [T] The score 
Smean is the mean of the S scores we obtained from the 
10 runs for each g value. Similarly, S max and S m i n are 
the maximum and minimum of these scores respectively. 



C. Synthetic Networks 

It is not easy to find large datasets with known com- 
munity structures. To evaluate our algorithm further, we 
created a set of synthetic networks with known commu- 
nity structures. The networks are based on a very simple 
network generation model used in Ref. |7j. They consist 
of either 128 or 512 vertices. The average degree is set 
to 16 and the vertices are assigned to 4 predetermined 
communities of equal sizes. There is a single parameter 
called z out which regulates the average number of edges 
that a vertex makes with members of other communities. 
If z out = then every vertex has only edges connecting 
to members of its own community. As we increase z out , 
we obtain networks with weaker community structures. 
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FIG. 1: S versus g for college football dataset. 



When z out = 12, the community structure in the topol- 
ogy is completely lost and the network becomes a random 
network. 

We would like to see the performance of our algorithm 
as a function of z out . For this purpose, we set the number 
of vertices n — 128 and created 100 networks for each 
value of z out G {!,••■, 12}, that is 1200 networks in total. 
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FIG. 2: S versus z ou t for synthetic networks with n 
and (a) g = 750, (b) g = 3000 
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We set population size p = 200 and the number of 
generation g — 750. We ran our algorithm 10 times on 
each network, calculate S of each run and obtain 10 S 
values. We take the minimal S m i n , the mean S mean and 
the maximal S max values of the 10. Since we have 100 
networks for the same z ou t value, we take the average of 
Smin, S mean and S m ax over 100 networks for a particular 

%out' 

The resulting scores are presented in Fig. 2(a) as a 
function of z out . The scores obtained by Fast Newman 
are also given for comparison [7(. We also include the 
scores of a dummy algorithm which assigns each vertex 
randomly to one of the four clusters. 

The small gap between the maximal and minimal <S* 
scores suggests that the performance of our algorithm 
is robust and does not change from one run to another 
significantly. The scores of the genetic algorithm and 
Fast Newman increase when we increase z ou t as expected 



because of the weakening community structure. Note 
that Smean scores of the genetic algorithm and the Fast 
Newman's scores are compatible for high values of z out 
but Fast Newman significantly outperforms the genetic 
algorithm for lower values of z out . It seems that when 
the underlying community structure is strong (i.e. the 
problem at hand is trivial) the genetic algorithm is unable 
to converge to a solution as optimal as Fast Newman can 
find. When the community structure is weakened, the 
difference between the two algorithms disappears. The 
question whether this behavior is due to a lack of the fine 
tuning of the model parameters or an intrinsic property 
of our genetic algorithm calls for further investigation. 

What is the response of our algorithm to increasing 
the number of generations? In order to give an idea we 
present Fig. 2(b) which contains results obtained in the 
same way but this time with g = 3000 generations for 
each run. The results are qualitatively similar but the S 
values of the genetic algorithm is lower in general. The 
improvement in the scores of our algorithm suggests that 
it is possible to obtain better solutions by increasing the 
number of generations. 
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FIG. 3: S versus g for synthetic networks with n — 128, 



In order to analyze this point, we set z out = 6 and 
calculated the scores for differing number of generations. 
In Fig. O we see that the solutions converge after a cer- 
tain number of generations. The score of the Newman's 
algorithm is also given by the flat line since it does not 
depend on our model parameter g. 

To examine our model with larger networks we re- 
peated the same set of experiments with different number 
of generations {z ou t = 6) but this time on networks with 
n = 512 vertices. In Fig. [3J we see that our algorithm 
still provides results comparable with (and even better 
than) Fast Newman. 

Note that, for fast algorithm of Newman, we use our 
knowledge on number of real communities by cutting the 
dendogram just at the right place while the genetic algo- 
rithm lacks this information and is still comparable with 
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FIG. 4: S versus g for synthetic networks with n — 512, 

Zout = 6 



using genetic algorithm methods. The contribution of 
this study is the introduction of a genetic algorithm for 
the community detection problem which does not require 
any information about the number of communities in the 
network. The results are compatible with previously in- 
troduced methods. Thus the employment of genetic al- 
gorithms for community detection problem is a viable 
approach. 
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